CODE-EHR best practice framework for the use of structured electronic healthcare records in clinical research

Abstract Big data is central to new developments in global clinical science aiming to improve the lives of patients. Technological advances have led to the routine use of structured electronic healthcare records with the potential to address key gaps in clinical evidence. The covid-19 pandemic has demonstrated the potential of big data and related analytics, but also important pitfalls. Verification, validation, and data privacy, as well as the social mandate to undertake research are key challenges. The European Society of Cardiology and the BigData@Heart consortium have brought together a range of international stakeholders, including patient representatives, clinicians, scientists, regulators, journal editors and industry. We propose the CODE-EHR Minimum Standards Framework as a means to improve the design of studies, enhance transparency and develop a roadmap towards more robust and effective utilisation of healthcare data for research purposes.


Abstract
Big data is central to new developments in global clinical science aiming to improve the lives of patients. Technological advances have led to the routine use of structured electronic healthcare records with the potential to address key gaps in clinical evidence. The covid-19 pandemic has demonstrated the potential of big data and related analytics, but also important pitfalls. Verification, validation, and data privacy, as well as the social mandate to undertake research are key challenges. The European Society of Cardiology and the BigData@Heart consortium have brought together a range of international stakeholders, including patient representatives, clinicians, scientists, regulators, journal editors and industry. We propose the CODE-EHR Minimum Standards Framework as a means to improve the design of studies, enhance transparency and develop a roadmap towards more robust and effective utilisation of healthcare data for research purposes.
In the context of ageing populations and increasing multimorbidity in all disease areas, 1-3 large scale, real world data provide an opportunity to better understand the epidemiology of rare and common conditions, and to improve prevention strategies and treatment stratification. 4 Tailored management for individual patients has become even more essential to constrain healthcare costs and provide patient centred care that can improve a patient's quality of life and prognosis. Embedding controlled trials within the real world setting, either within registries or routine clinical practice, is now possible and could provide more generalisable results to the population at large. 5 Health data science has undergone rapid development in the past decade, including the common adoption of electronic healthcare record (EHR) systems that condense clinical episodes into a set of coded, structured labels. 6 However, concerns over quality, data privacy, transparency, and comparability of these systems have limited the use of the evidence generated with structured healthcare data. These issues have also restricted acceptance by regulators, reimbursement authorities, and guideline task forces. Despite the availability of numerous reporting standards, consensus has not been met on how to realise the Findable, Accessible, Interoperable, and Reusable (FAIR) principles 7 in the context of structured healthcare data. Existing reporting checklists ask authors to indicate where in their paper particular design issues have been discussed. For example, STROBE (Strengthening the Reporting of Observational studies in Epidemiology) for observational studies, 8 RECORD (REporting of studies Conducted using Observational Routinely-collected Data) for routinely collected health data, 9 and CONSORT-AI (Consolidated Standards of Reporting Trials-Artificial Intelligence) for artificial intelligence interventions. 10 However, these checklists are often lengthy, no minimum standards are specified, and adherence does not relate to study quality or even the quality of transparency for that domain. 11 Although checklists can benefit research quality, they are often used for box ticking to facilitate journal publication. In a study of radiology journals, only 15% (120/821) of surveyed authors used the reporting guideline when designing their study. 12 With a proliferation of reporting checklists for every scenario, authors and readers of such reports are increasingly confused about the value of these checklists. As of 14 February 2022, 488 reporting checklists were registered with EQUATOR (Enhancing the QUAlity and Transparency Of health Research) and 111 were in development.
In the case of observational and randomised clinical research using EHRs and other structured data, the source of data, its manipulation, and underpinning governance are of critical importance to extrapolating results. Clarity is needed from a broad stakeholder perspective, providing a quality framework to enhance the design and application of clinical research that increasingly depends on these crucial new sources of data. This article reflects the joint work of a wide range of international stakeholders with a remit to improve the use of structured healthcare data. The programme was coordinated by the European Society of Cardiology, a non-profit organisation of healthcare professionals, and the BigData@Heart Consortium, a public-private partnership funded by the European Union Innovative Medicines Initiative. Our aim was to navigate opportunities and limitations, and to develop a framework for a broad audience of global stakeholders across all disease areas. The CODE-EHR framework seeks to realise the exciting opportunity that digitisation of health data affords to increase efficiency of healthcare systems, and improve the lives and wellbeing of patients.

Summary points
• Research using routinely collected structured healthcare data has the potential for major clinical impact but this requires a clear and transparent approach to describe data sources, linkage protocols, coding definitions, and validation of methods and results • A social license and public mandate are essential components of big data research that can provide societal benefit, addressing the concerns of participants, and ensuring data privacy and integrity • This paper describes the output of international stakeholder meetings for the use of structured healthcare data for research purposes, including patient representatives, clinicians, scientists, regulators, journal editors, and industry representatives • The CODE-EHR checklist provides a minimum standards framework to enhance research design and enable more effective use and dissemination of routine healthcare data for clinical research  representatives from the pharmaceutical industry, payers, leading  academic institutions, and professional societies (see acknowledgments). Development of the CODE-EHR framework was centred on two stakeholder meetings (7 July 2020 and 26 October 2020), consisting of presentations from key opinion leaders, followed by breakout sessions and plenaries to formulate statements on key topic areas. An iterative process with virtual work was used to achieve consensus positions, with a further meeting on 10 March 2022 to finalise this report. We aimed to develop pragmatic advice for the use of structured healthcare data within trials and observational studies that is not dependent on particular diseases, and that meets the expectations of stakeholders and the general public. Our objectives were to provide new direction to this increasingly important field in medicine, thereby enhancing the value of routinely collected data to improve future patient wellbeing. Detailed text on the current state-of-the-art for research using healthcare data, in addition to key challenges and limitations, was developed by the stakeholder group, supported by a writing committee. See appendix 1 in which we address the need for common standards to appraise the digital landscape (e.g, coding systems and the vital aspect of linkage), expand on current and future opportunities for the use of structured healthcare data, and show how a social license can lead to cocreation of research with a public health benefit. The key challenges and pathways for improvement are outlined in figure 1, which presents the process of structured healthcare data from initial notation to their potential use to enhance research and subsequently improve clinical practice.
The output of the stakeholder meetings and iterative discussions were condensed into four core central themes: technical process and data stewardship; data security and privacy; publications using structured healthcare data; and addressing the needs of regulators, reimbursement authorities, and clinical practice guidelines. Key statements and advisories from the consensus meetings are summarised in table 1.

Patient and public involvement
The CODE-EHR consensus approach has benefited from patient and public engagement throughout the development process, including representation from the European Society of Cardiology Patient Council and the European Heart Network, an alliance of foundations and associations supporting patients and representing patient interests. We describe a potential method for engagement of the public in future research that can constructively benefit research using big data ( fig 2).

CODE-EHR reporting framework
The path from structured healthcare data to clinical research output is complex. To support further development in a transparent way, stakeholder delegates reached consensus of the need for a set of minimum standards that authors could use as a tool to enhance design, reporting, and research output. The CODE-EHR Minimum Standards Framework presented in table 2 allows authors to report on how structured healthcare data were used in their research study (either in patient identification, disease phenotyping, or outcome derivation). Preferred standards indicate high level attainment of quality and can be used as a tool to improve the future trajectory of research. The checklist was created through an iterative process based on the stakeholder proposals and covers five key areas of enhanced transparency: how and why coding was performed; the process of constructing and linking datasets; clear definitions of both diseases and outcomes; the approach to analysis, including any computational methods; and demonstrating good data governance.
The framework aims to improve the quality of studies using structured healthcare data and to give confidence in their use for clinical decision making. See appendix 2 for a step-by-step approach to completion of the CODE-EHR reporting checklist, with relevant best practice examples. We also present a detailed description of the workflow that led to the checklist in appendix 3. Form versions of the checklist are provided in appendix 4 (word version) and appendix 5 (pdf version).

Discussion
Technological progress has led to rapid evolution in heath data systems with immediate impact in daily clinical practice. The potential for improving patient care and outcomes are clear, as are the challenges and limitations to achieving this objective. 22 Big data analytics now support large scale (and cost efficient) clinical research, with trials based within registries or the EHR itself now heralding a new era in evidence generation. These processes can be further developed by an accompanying social license and upskilling of knowledge for all stakeholders. Co-creation and shared decision making with patients and the public 23 is an important way to ensure appropriate data stewardship and privacy, leading to clinical impact through robust publications, regulatory decision making, and practice guidelines. In this paper, we have reported on a global multistakeholder process to develop a framework for researchers to use in the design and reporting of studies that include structured or coded healthcare data.
Digital health records are confusing for most researchers, with varying access to a myriad of different coding systems and classifications, and considerable differences across (and within) countries. Linkage of different health sources is often a core component of research based on structured healthcare data, and yet, this aspect is frequently overlooked when reporting such studies. Data privacy and the license for research can be severely compromised if linkage is not secure; hence, our focus is on transparency about how data are coded and linked, and how these approaches are openly discussed and documented. The stakeholder consensus meetings highlighted this area as a key concern for future research, supported by evidence that very few studies have provided sufficient detail to understand the research process. 24,25 The advent of registry and EHR-based randomised controlled trials 21,26,27 reinforces the imperative to see improvements in these areas, and to define new concepts for quality research. With the development of robust analytics supported by machine learning algorithms, 28 similar approaches have already been used to support artificial intelligence in healthcare. 29 A lack of transparency has a direct impact on the value of research using coded records, with issues arising for medical journals, regulators, clinical guideline writers, and more generally clinicians and the public. Bringing together the full range of these stakeholders, we aimed to take full advantage of recent technical developments to use structured healthcare data for research, to approach limitations directly, and to provide a framework across all medical fields where coded data can be used to improve patient care. A number of other overlapping themes emerged from the discussions, including the generation and retainment of public trust and confidence, and the need for coherent plans to deal with data security failures. Forethought about dealing with the harmonisation of data and the requirement for embedded validation methods were highlighted as key factors for future successful research. Similarly, education and communication are crucial for patients, citizens and healthcare professionals to effectively use the results from structured healthcare data studies.
The covid-19 pandemic has illustrated the need for rapid access to routine healthcare data to guide and monitor clinical care, and a clinical trial infrastructure to allow for immediate deployment. The digitalisation of healthcare, in particular the use of EHRs, offered the clinical community a unique opportunity to develop a learning healthcare system that could efficiently address the effects of covid-19. For example, information about the relationship between covid-19 and cardiovascular disease through linked EHR data that has combined primary care data, hospital data, death records, and covid-19 testing in more than 54 million people. 30 However, the pandemic also made clear the obstacles within various systems that restricted the sharing of data in almost real time that could direct care and help design clinical trials. Established governance, security, interoperability (system

Technical process and data stewardship
Research using structured healthcare data conducted according to the FAIR data principles of findability, accessibility, interoperability, and reusability. 13 Important considerations are transparency of who performed the coding, the coding system used, and the purpose of coding (reimbursement, diagnosis, etc.).
Clear and consistent identification and description of the sources of EHR data. Code lists and phenotyping algorithms can be described in detail and published, ideally before a study commences (for example on a coding repository or open-source archive). The minimum data required to meet the definitions will depend on the use case and can be reported to enhance transparency, in addition to the rationale for why certain decisions were made (for example, why one code was chosen over another, or what the effect would be if data collection periods were changed).
Validation at local, regional and global levels. Evidence demonstrating how algorithms have been externally validated, and also what quality assessment was performed on the research findings, for example on the accuracy, completeness and timeliness of the data. 14 Data quality rules can be used to assess coded data and allow comparisons across institutions and countries. 15 Reporting of methods used for data pre-processing and data linkage. This includes the methods used to assess the quality of linkage and the results of any data pre-processing and linkage (with provision of false positive and false negative rates, comparisons of linked and unlinked data, and any sensitivity analyses). 16 A flow diagram showing the processes for cleaning and linking different coding sources and datasets can aid understanding of the study design.
Reporting of the governance framework underpinning the study from a technical/data stewardship standpoint. This includes a clear purpose for data gathering and the parameters and time limit of consent, clear mechanisms for data processing ("what happens with my data"), and a description of what the data can and cannot be used for (i.e. the mandate given for research).

Data security and privacy
Working towards a new, sustainable mandate from the public and patients to use their health data may require moving away from abstract rules and regulations and towards more constructive governance, in which trust is a central concept. The trust of patients and the public in research institutions and in science is pivotal because of the liberties they give to researchers to use their data, which are the product of a social licence based on this trust.
Gaining this trust would benefit from understanding what society and stakeholders expect from scientists conducting health data research, with engagement of stakeholders from the concept stage. Co-creation of data governance based on inclusion of patient/public communities and dialogue with researchers is crucial for ethical and sustainable governance, and to translate expectations into scientific research and scientific output.
Researchers and big data consortia have to be mindful that trustworthiness comes with the duty to act in ethically-responsible ways. This concerns two areas; first is the competence in data handling (meaning that systems are in place to ensure data protection and there is a framework of rules and regulations for data sharing), and second is what motivates the data analysis. Ongoing dialogue can ensure that public values continue to be aligned with the governance structures of health data research projects. Questions arise as to how to measure success at implementing public values into research, and what levels of public support are sufficient to grant a mandate for data usage.
Complex organisational structures may be less important for this trust than is often asserted. Complicated rules and regulations may do more harm than good in establishing the conditions for public trust in big data health research to flourish, and as a result be counterproductive especially when a social license has not been adequately achieved. 17 Embracing values such as transparency, reciprocity, inclusivity and service to the common good. These values can be embedded into the governance framework of big data health research. 18 This calls for constructing a narrative that researchers and research consortia can be held accountable so that patients and the wider public are willing, and consistently willing, to place their trust in health research projects.
Governance could be aided by developing a framework for accountability. This includes clear distinctions between anonymised, pseudonymised and aggregate data along with plain language explanation to participants and users, and discrimination between primary and secondary use of data sources.

Publications using structured healthcare data
Accountability for the source of data and how the data have been collected (traceability). As with data security, a framework of accountability would enable editorial teams in medical journals to be aware of the technical processes prior to data analysis.
Sharing of data, codes and algorithms used to analyse datasets. Similar to the requirement for pre-registration of clinical trials and pre-publication of protocols, journals could restrict publication where the coding within a study is not shared.
Demonstration of data validity and robust analysis. The FDA and EMA already suggest independent checking or accreditation of data sources; this accreditation could be provided to editors to increase their confidence in data quality.

Continued
architecture that spans different EHR systems and healthcare providers), and phenotype definition, among other issues, limited access to routine EHR data especially in the first period of the pandemic. The CODE-EHR framework is intended to complement available reporting checklists. [31][32][33][34] Although existing checklists are aimed at transparency in the reporting of important methodological components of clinical research, the CODE-EHR framework is designed to ensure that a common set of minimum standards are applied across all research using structured healthcare data. This range includes observational studies and controlled trials, with the preferred standards giving the direction of research design for all future EHR studies. Additionally, the framework supports the wider implementation of good quality real world data research based on the FAIR data principles. 13 Researchers are advised to use the checklist during the design phase of their study to ensure that key criteria for successful research and research impact are already embedded. The process will aid journal editors, regulators, guideline writers, clinicians, and patients to better appreciate the underpinning value, and also the limitations of the study. Dissemination plans for CODE-EHR include discussion with journals to request authors complete the checklist when submitting relevant research, attaining full registration with the EQUATOR network, 35 and outreach via international digital health groups to engage their members and other relevant stakeholder organisations. After publication, the CODE-EHR framework will undergo a two year evaluation, including discussion with researchers using the approach, with a plan for iterative improvements to adapt to this rapidly developing field of medical research.

Conclusion
The CODE-EHR framework was designed by a multistakeholder panel to improve design and reporting of research studies using structured electronic healthcare data. Research using these data sources is a vital component of future healthcare evaluation and delivery and will take an increasingly important role in decisions by regulatory,

Workshop theme Key consensus statements and advisories
Balancing the speed of publication against requirements for data validation. Prompt publication (for example of results with immediate public health implication) needs to be balanced against validation of data sources to ensure authenticity.
Scientific advice committees with experts in big data analytics to aid journal editorial teams. The skill-set required in editors and reviewers for studies using structured healthcare data is not the same as having statistical or clinical trials experience; expertise in EHR data and respective coding systems could add value to the journal review process.
Widening gap between the knowledge of physicians and the advanced methodologies used in big data papers. Medical/graduate students and practising clinicians, as well as hospital managers and leadership, need training in health data management and analysis. This is important to build a digital workforce with increased capacity and capability to translate publications using new approaches to improve patient care. Further research may help explore cases in which EHR studies produce valuable evidence, and when they might be flawed. This will generate confidence in regulators for future EHR studies, and for guideline taskforces to appropriately appraise evidence.
Quality standards will help to ensure that the information recorded in EHR systems represents real events without bias. This will enable confidence that trials using EHRs can produce reliable results on efficacy and safety, and could include examination of the validity of both data sources and data analyses.
Source data validation to report on appropriate computational phenotypes. This could be supported by an independent adjudication committee to examine a subset of the EHR and confirm outcome events. The use of AI techniques could facilitate larger validation studies by automated extraction of supporting text from clinical notations. Such validation exercises can be pre-registered, for example in the form of a Study-Within-A-Trial. 20 Another possibility is for researchers to provide consented and anonymised gold standard cases to benchmark against, or for data from devices used to verify codes (such as lead fractures). The value of synthetic datasets for validation, which mimic real data, needs further exploration.
Mixed model approaches to collect data on particular endpoints. May be valuable for situations where the EHR does not reliably collect relevant data. For example, where patients and/or clinicians are asked for information, or data are collected via wearable devices or telemonitoring. In some cases, parallel monitoring of patients alongside the EHR study may provide additional confidence (for example to identify serious unexpected adverse events). Technological advances in EHR systems will help, such as the ability to retrieve EHR data on a daily basis to support clinical trials. 21 Taking advantage of the many real world data initiatives to support new research. Government agencies, regulators, charities and professional bodies have initiated programmes for better use of real-world data that can support further activity and dissemination.
governmental, and healthcare agencies, as well as clinicians and patients in every medical specialty. The CODE-EHR checklist asks for clarity on reporting and defines a set of minimum and preferred standards on the processes that underpin coding, dataset construction and linkage, disease and outcome definitions, analysis, and research governance. Iterative updates to this framework are expected to enhance research quality and value and to generate new pathways for impact using routinely collected healthcare data.

Contributors
DK and FWA are joint first authors with equal contribution, and act as guarantors of the work. Correspondence to FWA at f.w.asselbergs@ umcutrecht.nl. Each subsequent listed author was involved in the CODE-EHR consensus process and wrote or revised sections of text and revised the final manuscript for intellectual content. The corresponding authors attest that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

Supplementary data
Supplementary data is available at European Heart Journal online.

Acknowledgments
We are grateful for the work of the CODE-EHR international consensus. Delegates and attendees of the meetings were:

Summary declaration
Choose one from: One or more minimum standards not met OR All minimum standards met Number of preferred standards met: / 5 Directions for use: Research team: To complete the checklist, authors will need to consider these points during the design of the research to ensure that coding protocols and coding manuals are pre-published. Where applicable, it is advisable that all five minimum standards are met for an individual research study, whether observational or a controlled trial. If any component is not applicable to the study, the corresponding author can indicate why this is the case in the comment box. This checklist can accompany the article as a supplementary file on submission to the journal, with the ability for readers to review responses. A comment on the meeting of standards in the text of the method section is suggested, eg; "this study meets all five of the CODE-EHR minimum framework standards for the use of structured healthcare data in clinical research, with two out of five standards meeting preferred criteria https://doi.org/10.1093/eurheartj/ehac426"; OR "this study meets four out of five of the CODE-EHR minimum framework standards for the use of structured healthcare data in clinical research; one of the five minimum standards was not met as coding schemes were not specified prior to analysis https://doi.org/10.1093/eurheartj/ ehac426." Note, easy to complete form versions of this checklist are available in appendix 4 (word version) and appendix 5 (pdf version), and also at https://www.escardio.org/bigdata. Research appraisers (patients, clinicians, regulators, guideline task forces): Where applicable, it is advisable that all five minimum standards are met for the research study to be considered robust.